2/4/2018

Why drug crime?

  • Cooler than parking tickets
  • Access to ~31,000 observations, over 5 years!
  • Have you seen The Wire?

Goals of this talk

  • Show distribution of crime in our city
  • Provide an open-template for analyzing more data
  • Get people excited about using R

Data Process

  1. Open Data Portal - Crime Reports
  2. Geocode via Google Map API
  3. Use library(sf) %>% tidyverse for spatial data exploration
  4. Model and test patterns with library(spdep) and glm()
library(tidyverse) # duh
library(magrittr) # %<>% life
library(viridis) # it's pretty
library(sf) # the new kid
library(spdep) # the grandaddy

Where is crime happening?

Most frequent addresses

Top 3 address for drug crime and not drug crime

arrange(crime_counts, -n) %>% group_by(drug_flag) %>% slice(1:3)
## # A tibble: 6 x 3
## # Groups:   drug_flag [2]
##                               address drug_flag     n
##                                 <chr>     <chr> <int>
## 1  600 E MARKET ST Charlottesville VA     drugs   410
## 2   400 GARRETT ST Charlottesville VA     drugs    38
## 3 700 PROSPECT AVE Charlottesville VA     drugs    38
## 4  600 E MARKET ST Charlottesville VA not_drugs   635
## 5 700 PROSPECT AVE Charlottesville VA not_drugs   412
## 6   1100 5TH ST SW Charlottesville VA not_drugs   341

The police station's address is 606 E Market Street….

What is going on at the police station?

"The answer is quite simple - when individuals walk in to the police department to file a report the physical address of the department (606 E Market Street) is often used in that initial report if no other known address is available at the time. This is especially true for incidents of found or lost property near the downtown mall where there is no true known incident location. The same is true for any warrant services that result in a police report occurring at the police department." - CPD

% group_by(drug_flag) %>% add_count(wt = n) %>% slice(1)

with(station_props, prop.test(n, nn)) %>% tidy



estimate1 estimate2 statistic p.value parameter conf.low conf.high

1 0.2221018 0.02175626 2135.459 0 1 0.1810225 0.2196687

method

1 2-sample test for equality of proportions with continuity correction

alternative

1 two.sided

```

Aggregate into areas

Census blocks make a lot of sense becuase:

  • Tons of data in Census and American Community Surveys
  • Reputable source with code books and APIs
  • Easy to access in R via ODP and library(tidycensus)

Start to do it

long_url <- "https://opendata.arcgis.com/datasets/e60c072dbb734454a849d21d3814cc5a_14.geojson"
census <- geojsonio::geojson_read(long_url, what = "sp") %>%
    st_as_sf()

ggplot(census, aes(fill = HU_Vacant / Housing_Units)) +
    geom_sf() + scale_fill_viridis()

Keep doing it

  • Start with geocoded version
crime <- read_csv("https://github.com/NathanCDay/cville_crime/raw/master/crime_geocode.csv")
crime %<>% filter(complete.cases(.))
crime %<>% filter(address != "600 E MARKET ST Charlottesville VA")
  • Convert to sf, with same CRS (critical)
crime %<>% st_as_sf(coords = c("lon", "lat"), crs = st_crs(census))
  • Use sf::st_within() and friends
crime %<>% mutate(within = st_within(crime, census) %>% as.numeric) %>% 
    filter(!is.na(within))

There are bunch of other great st_x(sf_a, sf_b) functions too. If you want to do it, there's a tool for it.

Done with it

  • Flag by interest
crime %<>% mutate(drug_flag = ifelse(grepl("drug", Offense, ignore.case = TRUE),
                                     "drugs", "not_drugs"))
  • Summarise with tidyverse
crime_block <- st_set_geometry(crime, NULL) %>% # remove geometry for spread() to work
    group_by(within, drug_flag) %>%
    count() %>%
    spread(drug_flag, n) %>%
    mutate(frac_drugs = drugs / sum(drugs + not_drugs)) %>%
    ungroup() # geom_sf doesn't care for grouped dfs/tbls
  • Join in
census %<>% inner_join(crime_block, by = c("OBJECTID" = "within"))

Hot blocks

ggplot(census, aes(fill = frac_drugs)) +
    geom_sf() + scale_fill_viridis()

Is it random?

Test with Moran's I statistic

##   statistic p.value parameter                            method
## 1 0.2129707   0.024       976 Monte-Carlo simulation of Moran I
##   alternative
## 1     greater

Get more data

Get median income and age data from the American Community Survey via tidycensus to supplement housing and demographics from the Census.

What are the strong predictors?

Since the response is the proportion of drug crime to total, i.e. bounded by 0 and 1, we want to use the quasibinomial distribution instead of the Gaussian.

##                  Estimate   Std. Error     t value     Pr(>|t|)
## (Intercept) -3.351612e+00 3.521631e-01 -9.51721491 7.500572e-11
## frac_black   1.178626e+00 3.948554e-01  2.98495478 5.397890e-03
## frac_vacant -1.181078e-01 2.220301e+00 -0.05319451 9.579076e-01
## income      -5.789340e-06 3.566446e-06 -1.62327999 1.143414e-01
## age          8.641429e-03 9.119871e-03  0.94753847 3.504655e-01

Test resid(mod) again for spatial correlation, a good model should have randomly dispearsed residuals.

##   statistic p.value parameter                            method
## 1 -0.156563   0.918        82 Monte-Carlo simulation of Moran I
##   alternative
## 1     greater

Questions?

Thanks for listening